Multi-Armed Bandit with Python


Presented by Robin Allesiardo from PJD/Search/AI (rallesiardo_at_solocal.com)

February 04, 2020 solocal

AB Testing

Definition of the stationnary MAB problem

  • $K$ arms, each associated to a mean $\mu^k \in [0,1]$
  • The optimal arm is ${k^*=\arg \max_{k\in K}\mu_k}$
  • The optimal mean reward est ${\mu^* = \max_{k\in K} \mu_k}$
  • A reward $r_{k_t, t} \in [0,1]$, is drawn from a distribution of mean $\mu_k$
In [2]:
from numpy.random import binomial

class MAB_Problem:
    def __init__(self, T = 100000, K = 2, delta = 0.1):
        self.T, self.K, self.delta = T, K, delta
        self.mean = [0.5 + (self.delta if i == 0 else 0) for i in range(self.K)]
        """ Caching the reward list """
        self.rewards = [binomial(1, i_mean, int(T)) for i_mean in self.mean]            
        
    def start_game(self):
        """ Move through the game """
        for t in range(int(self.T)):
            self.t = t
            yield int(t)
            
        del self.t
            
    def play(self, k):
        """ Get the reward """
        return self.rewards[k][self.t]

Lower bound on the sample-complexity

The minimal number of samples needed to an algorithm to be $\epsilon$-$\delta$-PAC (Probably Approximately Correct) on a MAB problem is:

$$ \Omega \left( \frac{K \text{log}(\frac{1}{\delta})}{\epsilon^2} \right)$$

$\delta$ has a low impact (it's in the log) but the number of samples needed has a quadratic dependency to $\epsilon$.

Pseudo Cumulative Regret

$$R(T) = \sum_{t=0}^{t<T} \mu^* - \mu_{k_t} = \sum_{t=0}^{t<T} \Delta_k$$

Trade-off between Exploration and Exploitation?

Lower bound on the cumulative regret

We know that the mean cumulative regret of any algorithm will be at least

$$ \Omega \left( \frac{K \text{log}(\frac{KT}{\Delta})}{\Delta} \right).$$

The problem becomes harder as $\Delta$ is smaller.

Confidence intervals are our friends

Hoeffding Inequality : The difference between the empirical mean reward of an arm and its mean reward is at most $\epsilon$, with a probability 1-$\delta$; $\epsilon$ being $$\epsilon = \sqrt{\frac{\text{log}\left(\frac{c t^2}{\delta}\right)}{t}}.$$

Fun fact: The empirical mean is an $\epsilon$-$\delta$-PAC estimator.

In [3]:
import math, numpy as np
def get_confidence_bound(t, delta = 0.1, c=1):
    return np.sqrt(np.log(c*pow(t,2)/delta)/t)
In [5]:
fig.show()

The Random Policy

In [6]:
class RandomBandit():
    
    def __init__(self, K):
        self.K = K
        self.t = 0
        self.counts = [[1] for k in range(K)] # Number of time each arm is played
        self.rewards = [[0] for k in range(K)] # History of rewards
        self.cumulative_reward = [0] # The global cumulative reward
        
    def select(self):
        return np.random.randint(self.K)
    
    def observe(self, k_t, y_t):
        self.t += 1
        self.cumulative_reward.append(self.cumulative_reward[-1]+y_t)
        for k in range(self.K):
            self.counts[k].append(self.counts[k][-1] + (1 if k==k_t else 0))
            self.rewards[k].append(self.rewards[k][-1] + (y_t if k==k_t else 0))
In [7]:
mab = MAB_Problem(delta = 0.05)
player = RandomBandit(2)
rewards = []
for i in mab.start_game():
    k = player.select()
    y = mab.play(k)
    player.observe(k,y)
    rewards.append(y)
In [12]:
fig_AB.show()
In [14]:
fig_ABC.show()

Action Elimination

An arm $k$ is eliminated when

$$\hat{\mu}^\text{max} - \hat{\mu}_k > 2\epsilon $$

with $$\epsilon = \sqrt{\frac{\text{log}\left(\frac{c t^2}{\delta}\right)}{t}}.$$

UCB: Upper Confidence Bound

Concept: Against incertitude, be optimistic.

Algorithm: Playing the arm $k$ that maximizes

$$\text{ucb}(k) = \hat{\mu}_k + \sqrt{\frac{2\text{log}(t)}{t_k}}.$$
In [16]:
class UCB():
    
    def __init__(self, K):
        self.K = K
        self.T = 1
        self.t = np.array([1]*K)
        self.mean_rewards = np.array([0.]*K)

        """Visualization Stuff"""
        self.empirical_rewards = [[] for i in range(self.K)]
        self.confidence_bound = [[] for i in range(self.K)]
        
    def _get_ucb(self, k):
        return self.mean_rewards[k] + math.sqrt(2 * math.log(self.T) / self.t[k])
    
    def select(self):
        ucb = [self._get_ucb(i_k) for i_k in range(self.K)]
        max_ = np.argmax(np.array(ucb))
        """Visualization Stuff"""
        for i_k in range(self.K):
            self.empirical_rewards[i_k].append(self.mean_rewards[i_k])
            self.confidence_bound[i_k].append(ucb[i_k])
        return max_
    
    def observe(self, k_t, y_t):
        
        self.mean_rewards[k_t] = float(y_t + self.t[k_t] * self.mean_rewards[k_t]) / float(self.t[k_t]+1)
        self.t[k_t] += 1
        self.T += 1
In [21]:
fig_ucb.show()
In [22]:
class TS():
    
    def __init__(self, K, batch = 10):
        self.K = K
        self.T = 1
        self.failure = np.array([1]*K)
        self.success = np.array([1]*K)        
                
        """Visualization Stuff"""
        self.failures = [[1] for i in range(self.K)]
        self.successes = [[1] for i in range(self.K)]
        
    def _draw(self, k):
        return np.random.beta(self.success[k],self.failure[k])
    
    def select(self):
        draw = [self._draw(i_k) for i_k in range(self.K)]
        max_ = np.argmax(np.array(draw))
        return max_
    
    def observe(self, k_t, y_t):
        self.failure[k_t] += 1 - y_t
        self.success[k_t] += y_t

        """Visualization Stuff"""
        for i_k in range(self.K):
            self.failures[i_k].append(self.failure[i_k])
            self.successes[i_k].append(self.success[i_k])
In [23]:
mab = MAB_Problem(delta = 0.05)
player = TS(2)
rewards = []
played_arm = []
for i in mab.start_game():
    k = player.select()
    y = mab.play(k)
    player.observe(k,y)
    rewards.append(y)
    played_arm.append(k)
In [27]:
ts_distributions.show()
In [30]:
fig_ts.show()

Non-Stationnary Bandits

The mean $\mu_{k,t}$ changes during the run.

Classic algorithms do not provide guarantee against non-stationnarity.

Algorithms dealing with non-stationnarity

  • SER3 (Successive Elimination with Random Round-Robin)
    • Assumption of positive mean gap $\Delta_t$
    • Regret of $\frac{K \text{log}(T)}{\min_t{\Delta_t}}$
  • SW-UCB (Sliding Window UCB)
    • Piecewise stationary
    • Regret of $\sqrt{KT}$ (optimal on non-stationnary MAB)
  • EXP3
    • Adversarial rewards
    • Randomization against adversity
    • Regret of $\sqrt{KT}$

Contextual Bandits

Rewards depend on context.

Several family of problems:

  • Each arm is a context
    • ($x_0, ..., x_k, ..., x_K$)
    • Contexts allows to generalize rewards between arms
    • e.g. : Finding the best hyper-parametrization of an algorithm
  • The context change at each iteration and the reward depends on contexts
    • ($x_0, ..., x_t, ..., x_T$)
    • e.g. : User of a website are contexts, ads are arms.
  • Hybrid
    • e.g. : A context at each timestep (a query...) and arms with associated contexts (a result list...)

Contextual Bandits

  • State Based Contextual Bandit
    • Contexts are mapped to states
    • One MAB algorithm by state
  • LinUCB
    • Linear model with UCB
  • Contextual Thompson Sampling
    • Scores sampled from a multivariate distribution centered with a linear model
  • $\gamma$-Greedy
    • Exploration with a $\gamma$ probability, else exploitation with a model trained only with samples gathered during exploration
  • BanditForest
    • Pure exploration until the elimination of the non-optimal splitting criterion then Successive Elimination in the leafs.